A mixed-effects two-part model for twin-data and an application on identifying important factors associated with extremely preterm children’s health disorders

Our recent studies identifying factors significantly associated with the positive child health index (PCHI) in a mixed cohort of preterm-born singletons, twins, and triplets posed some analytic and modeling challenges. The PCHI transforms the total number of health disorders experienced (of the eleven ascertained) to a scale from 0 to 100%. While some of the children had none of the eleven health disorders (i.e., PCHI = 1), others experienced a subset or all (i.e., 0 ≤PCHI< 1). This indicates the existence of two distinct data processes—one for the healthy children, and another for those with at least one health disorder, necessitating a two-part model to accommodate both. Further, the scores for twins and triplets are potentially correlated since these children share similar genetics and early environments. The existing approach for analyzing PCHI data dichotomizes the data (i.e., number of health disorders) and uses a mixed-effects logistic or multiple logistic regression to model the binary feature of the PCHI (1 vs. < 1). To provide an alternate analytic framework, in this study we jointly model the two data processes under a mixed-effects two-part model framework that accounts for the sample correlations between and within the two data processes. The proposed method increases power to detect factors associated with disorders. Extensive numerical studies demonstrate that the proposed joint-test procedure consistently outperforms the existing method when the type I error is controlled at the same level. Our numerical studies also show that the proposed method is robust to model misspecifications and it is applicable to a set of correlated semi-continuous data.


Introduction
Approximately 10% of babies are born prematurely (< 37 weeks gestational age) worldwide every year [1]. Preterm birth is the leading cause of neonatal mortality and an important risk factor for developmental impairments including cognitive, behavioral, and social-emotional disorders [2][3][4][5][6][7]. Individuals born preterm also experience a higher risk of health disorders (e.g., asthma) and have a shorter life expectancy [8]. Our team has investigated 11 health disorders that are associated with preterm birth including bilateral blindness, hearing impairment, moderate/severe cognitive impairment, epilepsy, gross motor function impairment, attentiondeficit/hyperactivity disorder, anxiety, depression, asthma, autism, and obesity (i.e., body mass index above the 95 percentile) [9]. Further, preterm birth in the U.S. is associated with billions of dollars of annual societal economic burden [10]. On the other hand, some preterm children do not display major adverse health or developmental outcomes [9,11]. Investigating the outcomes of preterm-born children remains critical for evaluating and improving clinical care, planning long-term support and for advancing our understanding of the life-course consequences of immaturity at birth [7]. Identifying important factors that are associated with the risk and burden of such health disorders is important to both clinicians and researchers [12]. Indeed, determining important biomarkers associated with adverse health outcomes has long been recognized as a critical component in investigating disease etiologies, developing new therapeutic interventions, and accurately predicting disease progression [13,14]. However, the identification of such biomarkers remains a challenge.
The current research is motivated by recent epidemiological studies investigating the positive child health outcomes among 10-year-old children who were born extremely preterm [9,12]. One of our primary scientific interests in these studies was to identify important factors associated with the positive child health index (PCHI) outcome, which summates information about the presence or absence of 11 adverse health disorders at age 10 and transforms the cumulative number to a scale from 0 (the child experienced all the disorders) to 100% (no disorders), accounting for the number of non-missing responses [12]. That is, the fewer disorders the child experienced, the higher the child's PCHI score. Two aspects of the PCHI lead to analytic and modeling challenges when investigating associations between maternal antecedents, child's characteristics and PCHIs. The first involves the semi-continuous nature of the scale, which clearly reveals two distinct underlying data processes. While some children experienced none of these disorders (i.e., PCHI = 1), many others experienced a subset or all (i.e., PCHI < 1). That is, the PCHI, or alternatively the number of health disorder measurements, follows a mixture distribution, taking a boundary value under one condition, or a value arising from a continuous (ordinal) distribution under another. The second challenging feature of these data was that, in addition to singletons, it included twin and triplet clusters, within which PCHI measurements, or equivalently the number of health disorder experienced by each of the twins and triplets were correlated, as members of the same nuclear family share similar genetic structures and are exposed to comparable early environments. Even more challenging is the fact that these family-related cluster effects which within the two underlying data processes impact the propensity of incurring adverse health outcome and their subsequent burden, respectively, are themselves usually correlated with each other.
Previous research employed a strategy to identify important risk factors associated with the PCHI by dichotomizing the scale based on the number of disorders experienced (no any health disorder versus at least one health disorder, i.e., PCHI= 1 versus PCHI< 1) and using this binary outcome in a mixed-effects logistic regression with a random intercept to correlate children from a multiple birth [12]. Although this modeling scheme accommodates all the samples and is mathematically convenient, it can limit statistical power, since it is only capable of capturing the impact of risk factors on a binary health disorder status. However, in practice, these risk factors might also impact the disease burden, i.e., these risk factors may not only influence the propensity of developing health disorders but also impact the burden of disease (i.e., the number of health disorder experienced). More importantly, these risk factors usually impact both positively or negatively on the two data processes, i.e. in the same direction. Thus, jointly modeling the effects of these factors on the health outcomes can boost detection power. For independent samples, to jointly model this type of semi-continuous data, a two-part model can be adopted [15][16][17][18][19][20], where a logistic regression model is used to model the binary event of incurring no health disorders versus incurring at least one disorder, while a multiple linear (log-normal) regression is employed to model the association between the factors and the burden of health disorders (i.e., the nonzero data where the number of disorder experienced is converted to the proportion of the total 11 disorder, the larger the more severe). Special treatment is still required to account for the correlations among twin or triplet samples. To this end, we adopt the mixed-effects two-part modeling framework developed for longitudinal semi-continuous data [21]. Jointly, we model the correlated semi-continuous data by taking into account the sample correlations between and within the two underlying data processes. Under this framework, we derive a joint-test procedure for assessing each risk factor's effect on the correlated semi-continuous outcome measurements.
The remainder of the paper is arranged as follows. In Section 2, we provide a detailed description of the proposed mixed-effects two-part modeling framework for the correlated semi-continuous data. We then apply the proposed joint-test procedure to identify important factors associated with the health disorders for extremely preterm children in Section 3 to demonstrate its benefits over the existing method. We conduct extensive numerical studies under different data complexity scenarios in Section 4 to explore the applicability and performance as compared with the existing method and to demonstrate the generalizability and robustness of the proposed joint-test procedure for the correlated semi-continuous data. The paper concludes with some discussions in Section 5.

Joint-test procedure under mixed-effects two-part model for correlated semi-continuous data
Before providing more detailed explication of our proposed joint-test procedure, we first introduce some notations. Let X represent a matrix of all the observed clinical, demographic and biomarker variables, and Y represent an outcome vector of interest (e.g., the health disorder score), which can be zero or a continuous positive value. In the context of PCHI, equivalently, Y = 1 − PCHI where zero corresponds to experiencing no any adverse health issues while a positive value represents experiencing the proportion of 11 adverse health disorders (the larger the value, the more adverse outcomes experienced). We introduce another binary variable vector, Z, which is a dichotomization of the adverse health status, such that Z = I (Y > 0) where I is an indicator function. Our primary scientific interest is to model Pr(Y > 0 | X) and E[Y|Y > 0, X)] such that we can identify risk factors that are significantly associated with the propensity of being in disease status (i.e., Z = 1) and the burden of disease (i.e., Y > 0) when in disease state.
To model the association between the observed risk factors X and the adverse health status Z for the correlated data, we adopt a mixed-effects multiple logistic regression model, while the impacts of risk factors on the burden of the adverse health outcome are modelled with a log-normal distribution as follows, although other mixed-effects (e.g., linear or Poission) models can be employed. Our models can be written as follows: where x k,ij is the k th observed risk factor of the j th subject in family represents a random error term. The family-specific effects on the propensity to have at least one of the health disorders, and the proportion of having all health disorders (conditional on having at least one of them) are represented by u i and v i , respectively; to characterize the correlation between them, we assume they follow a bivariate normal distribution with mean zero as following: We can write down the total likelihood function as follows: where α = (α 0 , α 1 , � � �, α p ), β = (β 0 , β 1 , � � �, β p ), representing the risk factor effects on the propensity of developing at least one health disorder and the effects on the disease burden (e.g., characterized by the proportion of 11 health disorders involved) conditional on the subject developing the health disorder, respectively, and f (�) denotes a density function. Estimates for the parameters (α; β; s 2 ; s 2 u ; s 2 v ; r) can be obtained by maximizing the above likelihood function. However, this optimization could be challenging since the likelihood function involves two intractable integrals, which usually cannot be dealt with separately, except in the case of independent data [22]. Several strategies can be adopted, for example, the Fisher score approach via Laplace approximation [21], or the quasi-Newton optimization via adaptive Gaussian quadrature [23,24]. In this paper, we adopt a hybrid expectation maximization (EM) and quasi-Newton algorithm [25].
To test the k th hypothesis: pÞ, we construct the following test statistic: whereâ k andb k are the maximum likelihood estimates (MLEs) for parameters α k and β k , respectively, from Model (3). They follow a bivariate normal distribution with mean zero under the null as follows:â being the variance of the MLEs, respectively, and % being the correlation between these MLEs. S can be estimated by maximizing the likelihood function (3) as well. Test statistic z follows a χ 2 (2) distribution under the null.
In summary, the existing analysis method for this type of correlated semi-continuous PCHI data, i.e. Model (1), only models the relationship between the risk factors and the chance to be in positive health disorder status by dichotomizing PCHI scores (i.e., PCHI = 1 vs PCHI< 1). However, the risk factors usually not only influence the chance of being in health disorder status but also impact the disease burden. The proposed analysis method, i.e. Model (3), considers the risk factors effect on both the probability and burden of being in health disorder by incorporating Models (1) and (2). Furthermore, although this is a cross-sectional study (i.e. PCHI scores measured at age of 10-year), the proposed analysis method adopted mixed-effect models for longitudinal data to take into account the correlations between and within the twin's and triplet's PCHI (health disorder) measurements in the two data processes.

A practical application
To investigate the performance of the proposed method in practice, we apply it to the analysis of data from the Extremely Low Gestational Age Newborn (ELGAN) study, which has motivated this research. The ELGAN study is a prospective longitudinal observational cohort study. Study participants were enrolled at 14 hospitals in 5 states in the United States (Connecticut, Massachusetts, Illinois, Michigan, and North Carolina) between April 2002 and August 2004. The only inclusion criteria were birth at one of the enrollment hospitals and birth before 28 completed weeks of gestation [26,27]. The only exclusion criterion was anencephaly. Longitudinal follow-up of the 1198 surviving participants in the ELGAN cohort has included research clinic visits at 1, 2, 10, and 15 years of age. Children were diagnosed with cerebral palsy based on a standardized assessment at 2 years of age, adjusted for degree of prematurity [28]. All other assessments that were used to derive the positive child health index occurred when study participants were 10 years of age. Parent report was used to identify bilateral blindness and hearing impairment. Children's weight and height were measured by research coordinators at 10 years of age and body mass index (BMI) percentiles were calculated based on measured weight and height and age-and sex-specific US growth standards; obesity was defined as a BMI percentile � 95% [29]. Asthma diagnosis at age 10 years was based on parent or guardian report of a health care provider's diagnosis of asthma [30]. Here, we focus on estimating the impacts of sex, race, birth weight, maternal pre-pregnancy BMI, maternal smoking status during pregnancy, maternal age, maternal education level, histologic chorioamnionitis (a prenatal infection of the fetal membranes), pre-pregnancy maternal asthma status, and insurance status (public health insurance at birth) on the health disorders measured at age of 10 year-old for extremely preterm-born children, and identify the important factors associated with the health disorders. The data consist of 776 complete samples in total, of which 550 are singletons, 188 are twins (94 pairs), 33 are triplets (11 triplets), and 5 are quintuplets. The distributions of all observed clinical and demographic variables are summarized in Table 1. We note that nearly one third of the subjects (n = 251) reported no any health disorders at age 10.
We compare the performance of our proposed analysis method, i.e. the joint mixed-effects model (3) which jointly tests the effect of each feature using the test statistic (4) as described above, with the existing analysis method for PCHI data, i.e. mixed-effects logistic regression approach (1). In our analysis, pre-pregnancy asthma and histologic chorioamnionitis were included in the two-part model since they are potential confounding variables for the health disorders [12]. The parameter estimates and associated 95% confidence interval (CI) are presented in Table 2. As shown, at the 0.05 level of significance, the conventional approach detects sex, race, maternal pre-pregnancy BMI and public health insurance (rather than private insurance) as important risk factors, while the joint-test procedure identifies sex, birth weight, maternal pre-pregnancy BMI and public health insurance. However, after adjusting for multiple comparisons using the Hommel [31] or Bonferroni correction, only maternal pre-pregnancy BMI remains significant in the conventional method with an adjusted p-value of 0.008 (the adjusted p-values for sex, race and public health insurance become 0.177, 0.189, 0.115, respectively). Conversely, sex, maternal pre-pregnancy BMI and public health insurance remain significant in the joint-test procedure after this adjustment, with p-values of 0.010, 0.005, and 0.005, respectively. Although both methods detect pre-pregnancy BMI as a significant risk factor, we note that the adjusted p-value from the joint-test procedure is much smaller than that from the conventional method (0.005 vs 0.008). The consistency of corresponding pairs of parameter estimates from the mixed-effects two-part model is reassuring. That is, the estimates for sex (0.386;0.143), maternal pre-pregnancy BMI (0.346;0.053), and public health insurance (0.594;0.188), indicate that each influences the propensities of incurring health disorders and the health disorder burden (conditional on at least one health

PLOS ONE
disorder) in the same direction. Even though the parameter estimate (i.e., α) difference between existing method and the proposed method is small, joint-modeling includes the information from the second model (i.e. β). For example, using existing method, birth weight effect estimate is −0.143, which is not statistically significant. However, using the proposed jointmodeling method, the birth weight effect on the propensity of developing health disorder is estimated as −0.157, yet the birth weight is detected as a significant factor (p-value = 0.023) since the joint-modeling method considers the birth weight effect on the disease burden with a parameter estimate of −0.052. This provides intuition to understand why the joint modeling can potentially boost detection power compared with the conventional method. Also, in our analysis, treating pre-pregnancy BMI as a continuous or categorical variable won't change the overall conclusion.

Simulation study
The semi-continuous nature of the PCHI data is actually a relatively common feature of scientific and clinical data. For example, medical expenditures and length of hospital stay are two other examples that give rise to such data; in a given year some people accrue no medical expenses (or require no time in hospital) at all. Otherwise, these outcomes are represented by positive numbers (i.e., dollars or days). Examples of such data abound in economic studiesfor example, the size of an insurance claim will be a positive number when an incident occurs, and zero otherwise. Other examples include postoperative pain (POP) measured sometime post surgery; many surgery patients experience POP with varying intensity (i.e., the POP scores take positive values), while POP will completely resolve by the measurement occasion for many others (i.e., the POP scores are zero). An example with particular relevance to studies of extremely preterm babies is the duration of mechanical ventilation, which is 0 for a substantial proportion of infants and continuous positive values for others. We conducted extensive simulation studies to explore the applicability of the joint-test procedure under scenarios of varying data complexity. In our first scenario, we simulated the correlated data using a mixed-effects logistic regression model to generate the binary adverse health status z, and a log-normal linear mixed-effects model to generate the positive data
The random-effects terms u i and v i follow a bivariate normal distribution as follows:  Tables 3 and 4, respectively, controlling the type I error rate at the 0.05 level. The results in Tables 3 and 4 demonstrate that the joint-test procedure (i.e., the column labeled "Proposed Method") outperforms the traditional mixed-effects logistic regression method ("Existing Method") in detecting the important factors. For example, when n = 500 and ρ = 0, the power to detect the risk factors X 1 and W 1 (which influence both the binary events and the continuous positive outcomes) are markedly boosted from 0.298 and 0.332 to 0.859 and 0.880, respectively. Similar observations pertain to the other settings for varying n and ρ. Furthermore, when the traditional method fails to detect risk factors that only influence the continuous outcome, the joint-test procedure still can detect these factors with very high power. For instance, under the setting of n = 500 and ρ = 0.4, the traditional method fails to detect X 3 and W 3 (the negligible powers 0.059 and 0.059 are essentially equal to the nominal type I error rate), but the joint-test procedure has nearly full power to detect these two risk factors. In addition, the empirical type I error is appropriately controlled.
To further investigate the robustness of our two-part modeling approach to identify important biomarkers for correlated data, we conducted additional simulations using the following data generating models: logitðp ij Þ ¼ 0:8 þ 0:25x 1;ij þ 0:2w 1;ij þ 0:4x 2;ij þ 0:6w 2;ij þ u i y ij � Poissonð16 þ 0:65x 1;ij þ 0:85w 1;ij þ 0: where the causal risk factors, nuisance variables, and cluster effects settings are the same as in the above cluster log-normal scenario. Here, we still employ a multiple linear log-normal mixed-effects model for the positive portion of the data, but these data are now generated using a Poisson mixed-effects model. We ran 1000 Monte Carlo replications with a sample size n = 1000 and two different levels of correlation between the cluster effects (ρ = 0 and ρ = 0.4). The percentage of replications deeming the eight risk factors as statistically significant (while controlling the type I error rate at the 0.05) level are presented in Table 5. Again, the results in  Table 5 demonstrate that the joint-test procedure under two-part modeling ("Proposed Method") is much more powerful than the traditional one-part mixed-effects logistic regression model ("Existing Method") to detect risk factors while controlling the type I error rate at a reasonable level regardless of the correlation settings. Of note, even when the intensity model (i.e., the log-normal mixed-effects model) is misspecified for the positive ordinal data in the two-part modeling framework, the joint-test procedure still outperforms the traditional approach, further demonstrating its utility in clinical practice.

Discussion
In this study, we derived a joint-test procedure under a mixed-effects two-part modeling framework to identify important risk factors associated with the correlated semi-continuous outcomes. The application of the proposed method to the real PCHI data analysis has clearly demonstrated the advantages of the proposed method compared to the existing method for this type of semi-continuous data, i.e., it is more powerful to detect risk factors associated with the correlated semi-continuous outcomes. Further, this method is robust to model misspecification. In general, our proposed joint-test procedure is consistently more powerful than the traditional method while controlling the type I error rate at the same targeted level. The advantages of the proposed two-part model rooted from the fact that it jointly models the two data processes unlike the existing method (e.g., the censored or truncated regression) assumes one underlying data process for all subjects with a ceiling effect and thus is less flexible. These results manifestly demonstrate the advantages of the mixed-effects two-part model for the analysis of correlated semi-continuous data.
In pediatric practice, in addition to identifying important risk factors associated with crosssectional adverse health outcomes for preterm-born children, it is of clinical importance to investigate their associations with longitudinal health outcomes as well. It is of further practical importance to identify genetic variants that are associated with the health outcomes of these children. How to extend the proposed joint-test procedure to these even more complicated longitudinal correlated semi-continuous (and, in the latter case, high dimensional) settings warrants further investigations, but is beyond the scope of this paper.